by Lieby Cardoso

This data set was simulated and made available by Ludovic Benistant to the Kaggle site. Source: Kaggle - Ludovic Benistant - Hr Analytics

** Attributes: **

  1. Satisfaction_level: Level of collaborator satisfaction
  2. Last_evaluation: Rating from last rating
  3. Number_project: Number of projects worked
  4. Average_montly_hours: Monthly average hours worked
  5. Time_spend_company: Years of work in the company
  6. Work_accident: 0 - Did not have an accident at work and 1 - Had an accident at work
  7. Left: Dismissal indicator, 0 - Contracted and 1 - Off
  8. Promotion_last_5years: Promotion indicative, 0 - No and 1 - Yes
  9. Sales: Department
  10. Salary: Salary, Low, Medium and High

Objective The objective of this project is to explore the variables of the hr data set that can help in identifying the factors that led some collaborator to leave the company. During the investigation these questions will be answered:

Did the employees who left receive a low salary? Is the employee’s dissatisfaction perceptible? Overworked, was it a negative factor?

##  [1] "satisfaction_level"    "last_evaluation"      
##  [3] "number_project"        "average_montly_hours" 
##  [5] "time_spend_company"    "Work_accident"        
##  [7] "left"                  "promotion_last_5years"
##  [9] "sales"                 "salary"

** Data overview **

Summary:

##  satisfaction_level last_evaluation  number_project  average_montly_hours
##  Min.   :0.0900     Min.   :0.3600   Min.   :2.000   Min.   : 96.0       
##  1st Qu.:0.4400     1st Qu.:0.5600   1st Qu.:3.000   1st Qu.:156.0       
##  Median :0.6400     Median :0.7200   Median :4.000   Median :200.0       
##  Mean   :0.6128     Mean   :0.7161   Mean   :3.803   Mean   :201.1       
##  3rd Qu.:0.8200     3rd Qu.:0.8700   3rd Qu.:5.000   3rd Qu.:245.0       
##  Max.   :1.0000     Max.   :1.0000   Max.   :7.000   Max.   :310.0       
##                                                                          
##  time_spend_company Work_accident         left       
##  Min.   : 2.000     Min.   :0.0000   Min.   :0.0000  
##  1st Qu.: 3.000     1st Qu.:0.0000   1st Qu.:0.0000  
##  Median : 3.000     Median :0.0000   Median :0.0000  
##  Mean   : 3.498     Mean   :0.1446   Mean   :0.2381  
##  3rd Qu.: 4.000     3rd Qu.:0.0000   3rd Qu.:0.0000  
##  Max.   :10.000     Max.   :1.0000   Max.   :1.0000  
##                                                      
##  promotion_last_5years         sales         salary    
##  Min.   :0.00000       sales      :4140   high  :1237  
##  1st Qu.:0.00000       technical  :2720   low   :7316  
##  Median :0.00000       support    :2229   medium:6446  
##  Mean   :0.02127       IT         :1227                
##  3rd Qu.:0.00000       product_mng: 902                
##  Max.   :1.00000       marketing  : 858                
##                        (Other)    :2923

The data is consistent and analyzing the minimum, mean, median and maximum value, it was not possible to verify in this initial analysis the existence of outliers.

We will check the structure of the data, and whether it will be necessary to manipulate the types of variables to aid analysis.

str(hr)
## 'data.frame':    14999 obs. of  10 variables:
##  $ satisfaction_level   : num  0.38 0.8 0.11 0.72 0.37 0.41 0.1 0.92 0.89 0.42 ...
##  $ last_evaluation      : num  0.53 0.86 0.88 0.87 0.52 0.5 0.77 0.85 1 0.53 ...
##  $ number_project       : int  2 5 7 5 2 2 6 5 5 2 ...
##  $ average_montly_hours : int  157 262 272 223 159 153 247 259 224 142 ...
##  $ time_spend_company   : int  3 6 4 5 3 3 4 5 5 3 ...
##  $ Work_accident        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ left                 : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ promotion_last_5years: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ sales                : Factor w/ 10 levels "accounting","hr",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ salary               : Factor w/ 3 levels "high","low","medium": 2 3 3 2 2 2 2 2 2 2 ...

The data set consists of 14,999 observations and 10 variables. It does not have any variable that is a unique identification of the collaborator, so let’s consider that each record corresponds to one person, without duplication.

hr$salary<-ordered(hr$salary,levels=c("low","medium","high"))

The level of collaborator satisfaction will be grouped into three levels: YES, NO and REGULAR.

# Create new field satisfied
hr$satisfied <- as.factor(ifelse(hr$satisfaction_level<=.40, "NO", 
                                 ifelse(hr$satisfaction_level<=.70, "REGULAR", 
                                        "YES")))

Checking the created fields:

table(hr$satisfied)
## 
##      NO REGULAR     YES 
##    3124    5577    6298

Are there any null values? Let’s walk through all the variables by adding all null values.

# Verify NULL values
colSums(is.na(hr))
##    satisfaction_level       last_evaluation        number_project 
##                     0                     0                     0 
##  average_montly_hours    time_spend_company         Work_accident 
##                     0                     0                     0 
##                  left promotion_last_5years                 sales 
##                     0                     0                     0 
##                salary             satisfied 
##                     0                     0

The dataset is completely populated and you do not need to manipulate these values.

Univariate Graphics Section

In this section, the variables were grouped into two groups, the first one is to explore the information that helps us to characterize the company and the second, the variables that help us understand the profile of employees hired and left the company.

How is the company?

People who leave ratio:

prop.table(table(hr$left))
## 
##         0         1 
## 0.7619175 0.2380825

Left, a variable focus of the analysis, is divided between contractors, that is, who are still in the company and who leave. Of the total 14,999 employees, 3,571 were dismissed and 11,428 were active. A proportion of collaborators who have left a company for 24% of the data set.

This is a positively skewed distribution, with a reduction in the number of collaborators per year as it approaches the right tail. Let’s check this trend and confirm that the median value is less than the mean value through the statistical summary of the variable time_spend_company:

summary(hr$time_spend_company)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   3.000   3.000   3.498   4.000  10.000

The median is 3 years, corresponding to the year that in fact has the largest number of collaborators, being 6,443 people. The largest volume (75%) of employees in the dataset worked between 2 and 4 years, being the limit of the 3rd quartile in the 4th year.

This is an unexpected fact for me. Of the total of 14,999 employees registered in the data set, 2,169 suffered an accident at work, or 14% of the total. To compile a comparison, according to data from the Fundacentro - Ministry of Labor, in 2013, 717,911 injured people were registered in the pension plan, with a total of 96 million people employed. This gives us a percentage of 0.75%, well below the 14% recorded by our company.

The monthly average hours worked has a bimodal distribution, with a peak between 140 and 155 hours and another between approximately 245 and 265 hours.

Statistical summary of the variable average_montly_hours:

summary(hr$average_montly_hours)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    96.0   156.0   200.0   201.1   245.0   310.0

# Group data by sales
Funcionarios_departamento <- aggregate(hr$sales, by=list(hr$sales),
                                       FUN = length)

formattable(Funcionarios_departamento, 
            list(x = color_tile("white", "#6699CC")))
Group.1 x
accounting 767
hr 739
IT 1227
management 630
marketing 858
product_mng 902
RandD 787
sales 4140
support 2229
technical 2720

O departamento Sales é o com a maior quantidade de colaboradores, seguido pelo departamento Technical e Support, somando 9.089 colaboradores. O Management é o menor deles com 630 colaboradores.

Very few people promoted. Only 2% of people have already been promoted. Knowing the career plan policy would be important to understand this shortage of promotions.

Employee profile

There is no record of anyone working on a single project. The employees worked on at least 2 and at most 7 projects.

summary(hr$number_project)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   3.000   4.000   3.803   5.000   7.000

On average people worked on 4 projects.

sum(hr$last_evaluation >= 0.7)
## [1] 8015

The company is made up of a majority of employees with medium to high valuation. For this analysis we will consider that a grade equal to or greater than 7 is considered high. Thus, we have 8,015 employees well evaluated by the company, this corresponds to 53% of the people.

sum(hr$satisfaction_level <= 0.3)
## [1] 1941
sum(hr$satisfaction_level >= 0.7)
## [1] 6503

It is possible to observe that there are employees with a low level of satisfaction (<0.3). From the total, 1,941 people had a satisfaction level of less than 0.3. At the other end of the distribution, we have 6,502 employees with a satisfaction level equal to or above 0.7.

The satisfaction bandwidth was created to provide one more element of comparison between the remain and left. For the purpose of this analysis, consider that the ranges:

No = Satisfaction level equal to or less than 0.4 Regular = Satisfaction level between 0.4 and 0.7 Yes = Satisfaction level greater than or equal to 0.7

# Satisfaction proportion grouped by left
prop.table(table(subset(hr, select=c(satisfied, left))),2)
##          left
## satisfied         0         1
##   NO      0.1215436 0.4858583
##   REGULAR 0.4089954 0.2528703
##   YES     0.4694610 0.2612714

When we compare the proportions between the hired and those who leave, we notice that for the hired the satisfaction ranges that prevail are those of satisfaction and regular satisfaction, the two corresponding to 87% of the hired. For those who leave the dissatisfied range alone corresponds to 49% of this set. It is quite significant to think that half of who leaves were not satisfied.

Among the hired we observe that the satisfied ones form the largest group, followed by those of regular satisfaction, dissatisfied individuals are the minority.

Univariate Analysis

What is the structure of the dataset?

The data structure consists of 14,999 observations, 10 variables imported with the original file. The satisfied variable was added to represent in 3 bands (YES, REGULAR, NO) the satisfaction of the collaborator.

The variable left represents the contraction status, with the domain: 0 = People who are still hired 1 = People who have been disconnected from the company.

In the data set we have 8 numerical variables and 2 categorical variables. The categorical variable sales has 10 levels and salary has 3.

What are the main attributes of interest in this data set?

The main variable of interest in the set is left, which tells you if the employee stay in the company or has left. The variable with the satisfaction level (satisfaction_level) will also be the focus of the analysis, since the correlation of the other variables with it can help to understand the reasons that led people to leave the company.

What other attributes do you think can help you in researching these attributes of interest?

  1. Promotion_last_5years: The total of promotions received can generate a dissatisfaction in the collaborator if he supposes that it was not recognized;

  2. Time_spend_company: The time spent in a company can generate a need to leave the company, not always for a negative cause, sometimes motivated by the will to live new challenges;

  3. Salary: This variable is categorical with the low, medium, and high values. The relationship between pay and satisfaction and exit decision will be evaluated, although several surveys indicate that pay is not always a major influence on this decision. It is possible that variables like number of promotions and note received in the last evaluation weigh more on the sensation of having the work recognized and this is an important factor;

  4. Number_project: This variable brings the total number of projects worked by the person, we do not know if it is the total in a certain period or during the whole stay in the company, but even though we can explore issues such as: people with more project numbers were more satisfied Or felt overwhelmed and left the company?

  5. Sales: Is there a problem department with a low level of satisfaction? I believe that this issue is important to be investigated and can be impacting at the final of the result.

Have you created new variables from the existing attributes in the data set?

The data set was very consistent, and only one change was made.

The satisfied variable, of factor type with 3 levels:YES, NO and REGULAR was created. The goal of this field is to facilitate the visualization of employees satisfaction ranges. The ranges defined for this project without any commitment to the standard values of a human resources department.

Of the attributes investigated, were unusual distributions found? Have you applied operations on the data to clean them, set them or change the shape of the data? If yes, why?

I searched for NA variables, but the group had none. Variables without content were not searched because in the file import I assigned by default the NA value to them (na.strings = c (“”, “NA”)) so, automatically when searching for NA, I was already searching for the empty .

In some graphs, I have turned the left variable into factor to help group the data into 0 and 1, not the average of these two numbers.

Looking at the summary of the variables, the data is consistent and without outliers, so it was not necessary to manipulate the content.

I reordered the salary content so that when printing the graphics I was able to demonstrate the data sequentially (Low => Medium => High).

The variable salary, which content is the salary range (high, medium, low), would be a great candidate to have the log10 operation applied to its content if it were the monetary value of the salary.

Section of Bivariate Graphs

Now that we know a little about the variables of the data set, we will check how they relate to each other. In this section we will try to answer the questions described at the beginning of the project.

Did those who left receive a low salary?

When we put into perspective the salary bands by department grouped by the status of the hiring, it is possible to visualize the difference in the distribution of the salary bands among the hired and dismissed. For the unemployed we noticed few occurrences of high salaries, most of these occurrences being in the Technical and Sales departments. In the Management department there were no registered employees with high salaries among the dismissed, but among those hired this is the department where most employees receive a high salary. This pattern is not repeated in other departments.

Investigating the department a bit more, we will look at the proportion of people who left each department.

sales 0 1
accounting 73 27
hr 71 29
IT 78 22
management 86 14
marketing 76 24
product_mng 78 22
RandD 85 15
sales 76 24
support 75 25
technical 74 26

The departments of management and research and development (RandD) had the least evasion of employees. The other departments 22 to 29% of their employees left the company. Below we can see this distribution:

We found a very similar distribution of data among the hired, they worked from 150 to 250 hours in the month. For who leave, what we observe is a distribution with a minima of less than 150 and maximums above 250 monthly hours. In the previous graphs we saw that they tend to earn less, but only with that information did I imagine that they could work less, but now, we find that contrary to what I thought, they worked for longer hours, especially in the RandD department where a higher average was recorded. In the marketing and HR department, people worked between 150 and 250 hours, but with a lower median.

I will create a new dataset with the average hours worked by each department grouped by left, so we can confirm these findings:

Sales Remain Left
accounting 199.0373 207.0294
hr 199.2500 197.3070
IT 198.8868 213.8498
management 200.2338 207.2637
marketing 198.8885 200.9901
product_mng 197.7656 207.7879
RandD 198.9520 210.9752
sales 199.5717 205.0414
support 199.1410 205.6360
technical 198.4711 214.1836

With the exception of the human resources department (hr), the employee who left the company worked on average more than the other departments, reaching a monthly average of over 200 hours. The cases where the disconnected group worked for more hours were highlighted in red.

Summary statistics: average_montly_hours x left:

## 
##     0     1 
## 11428  3571
## hr$left: 0
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    96.0   162.0   198.0   199.1   238.0   287.0 
## -------------------------------------------------------- 
## hr$left: 1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   126.0   146.0   224.0   207.4   262.0   310.0

We know that the employee who left work for longer hours and lower salaries, did they manage to stay in the company for many years?

Once again we have a different distribution for hired and who leave. In the set of data we have people who have worked in the company between 2 and 10 years. For those who worked for a total of 2 years, we noticed the smallest difference between the hired and who leaves, but as already noticed, even in this group, who leaves tended to work for more hours. The data for those who worked for 3 years is quite different from the others, they exhibit a new grouping of people who leaves and worked less hours than the average monthly. During the univariate graphs section, when we printed the distribution of the total number of people per years worked, we observed a peak in the 3rd year and from there a fall in the total of people per year after the 4th year.

summary(hr$average_montly_hours)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    96.0   156.0   200.0   201.1   245.0   310.0

Average hours worked in the month for the total number of years worked:

Anos trabalhados Remain Left
2 199.9564 210.7736
3 199.4515 147.3770
4 198.8122 269.6124
5 192.5766 246.3373
6 199.6542 242.2440
7 200.7447 NA
8 193.8025 NA
10 199.2243 NA

We have no records of people who left and who have worked for 7 years or more.

We see dissatisfaction with employees who leave the company in the 3rd or 4th year. People who left the company in the 5th grade, mostly showed low or high satisfaction. Most of then who worked at the company for 6 years demonstrated a high satisfaction level of over 0.75. The ideal for this case is that we had a history of the levels of satisfaction of these people, year after year, so we could project from which year the dissatisfaction tends to increase or decrease.

Is there a similar variation in the evaluation grade received by this employee during the years worked? To find out this, we will see the evaluation grade per year worked:

Again the 3rd year appears as a divisor. The people who left the company in the 3rd year received a low score in the last evaluation, a fact that does not repeat from the 4th to the 6th year, where the majority of those employees received grades higher than 0.8.

Let’s ask another question: Who has more projects works more?

We have no information on the level of complexity of the projects, or on the percentage of participation of each person in them. This distribution caused me some surprise and knowing a little more about the project would help us understand the following points:

  1. In the lowest average monthly hours worked, there is a group of people who worked on 6 projects, in a maximum of 7 projects reported.
  2. Between 100 and 200 hours, which is the average of this variable, is also registered a group that worked on more than 4 projects.
  3. People with 7 projects worked above the average of 200 hours, recording an average above 240 hours monthly.

I figured I could map some department that was overloading the developer with many projects, but all departments have people who have worked on 4 projects or more. Different from my expectation, in all departments were registered people with 7 projects.

Who worked on more projects received more?

Both people with high, medium or low salary worked on 2 projects or more. 144 people worked on 7 projects and received a low salary. So answering the question, I saw no evidence that the larger the number of projects, the better the salary range.

Even though they have suffered an accident at work, the employees remain in the company. This variable does not seem to be related to the employee’s exit.

Now that we know the importance of the variables satisfaction_level and average_montly_hours we will calculate the value of the pearson correlation for the data set variables hr:

The larger and darker, the more correlated are the variables, the oranges have negative correlation values and the positive blue ones. Let’s confirm by calculating the values found:

##                       satisfaction_level last_evaluation number_project
## satisfaction_level            1.00000000     0.105021214   -0.142969586
## last_evaluation               0.10502121     1.000000000    0.349332589
## number_project               -0.14296959     0.349332589    1.000000000
## average_montly_hours         -0.02004811     0.339741800    0.417210634
## time_spend_company           -0.10086607     0.131590722    0.196785891
## Work_accident                 0.05869724    -0.007104289   -0.004740548
## left                         -0.38837498     0.006567120    0.023787185
## promotion_last_5years         0.02560519    -0.008683768   -0.006063958
##                       average_montly_hours time_spend_company
## satisfaction_level            -0.020048113       -0.100866073
## last_evaluation                0.339741800        0.131590722
## number_project                 0.417210634        0.196785891
## average_montly_hours           1.000000000        0.127754910
## time_spend_company             0.127754910        1.000000000
## Work_accident                 -0.010142888        0.002120418
## left                           0.071287179        0.144822175
## promotion_last_5years         -0.003544414        0.067432925
##                       Work_accident        left promotion_last_5years
## satisfaction_level      0.058697241 -0.38837498           0.025605186
## last_evaluation        -0.007104289  0.00656712          -0.008683768
## number_project         -0.004740548  0.02378719          -0.006063958
## average_montly_hours   -0.010142888  0.07128718          -0.003544414
## time_spend_company      0.002120418  0.14482217           0.067432925
## Work_accident           1.000000000 -0.15462163           0.039245435
## left                   -0.154621634  1.00000000          -0.061788107
## promotion_last_5years   0.039245435 -0.06178811           1.000000000

For the left variable the correlations were found:

  1. Satisfaction_level = This was expected, with a value of -0.39 that is variable with the strongest correlation with left
  2. Work_accident = Poor correlation, value -0.15
  3. Time_spend_company = Poor correlation, value 0.14
  4. Average_montly_hours = Very poor correlation, value 0.07
  5. Promotion_last_5years = Very weak correlation, value -0.0

A little more graphical visualization between the main variables.

ggpairs hr

ggpairs hr

Bivariate Analysis

Discuss some of the relationships observed in this part of the research. How did the attributes of interest vary in the data set?

Among those who left we found two distinct groups, those who worked longer than the average and, surprisingly, those who worked for fewer hours per month.

We have records of people who worked between 2 and 10 years in the company, but left completed a maximum of 6 years of work in the company. Among who stay it is possible to perceive a good level of satisfaction for those who are in the company for more than 7 years.

Among those who left, we found the lowest levels of dissatisfaction for employees who worked for 4 or 5 years in the company.

Most of those who had an accident at work during their journey remained in the company, there is a very weak correlation between these variables of -0.1546.

In all departments we had people who worked 7 projects, which was the maximum registered. People who worked on 7 projects, worked for more hours in the month, when purchased with an average of 201 hours per month. Number_project and average_montly_hours has the strongest correlation value of the group being r² = 0.4172.

When analyzing the salary range by department, we observed that for who left we found more people who received low or medium salaries. The range of high wages is less in all departments. No shutdowns were reported in the Management department of people with high salaries.

Have you noticed any interesting relationships between the other attributes (those that are not of interest)?

The strongest relationship was found between the number of projects and average hours worked. It is not clear if the total of projects worked corresponds to a period, for example, are the projects worked in the month, in the year, since hiring? Assuming it’s monthly, it would make sense to say that whoever did most projects, consequently may have had to work more hours in the month. Knowing the level of complexity of the project would help us in interpreting the relation of the number of projects and the time required to execute it.

What was the strongest relationship found?

In what concerns this analysis, which is to investigate why some employees left the company, the variable level of satisfaction is the one that has the strongest correlation with left, indicator of shutdown. The correlation value was -0.388375, that is, 40% of the disconnections could be explained by the level of satisfaction, but that does not mean that it is the cause of it.

Multivariate Graphics Section

Once again we were able to visualize a different data distribution for those who left the company. They are polarized into two well-defined groups, sometimes working above or below the average. In the Sales and Support departments there is a higher incidence of the average salary range, but in the left group it is not possible to observe the prevalence of this range, most of them received a low salary, even in cases where they worked for more hours.

In the Management department, the affiliates also received low or medium salaries.

In this analysis we are assuming that people have the same position within the department, so that the comparison of the salary range is not meaningless.

We could not visualize a relationship between working longer hours and getting more for it, but who was involved in more projects received more?

This chart brings new and relevant information, only who left worked on 7 projects, one more evidence of who left was overwhelmed. Those who left the company involved in 2 projects, regardless of the salary range, predominantly received low scores in their evaluations. People with more than 4 projects tended to receive better grades in their assessments.

If we put employee satisfaction in focus, do we see a graph similar to the one above?

We have people who left with the most diverse levels of satisfaction, the proportion of people by satisfaction range and numbers of projects can help us to understand these values.

##               satisfied
## number_project          NO     REGULAR         YES
##              2 0.479897894 0.513082323 0.007019783
##              3 0.194444444 0.388888889 0.416666667
##              4 0.078239609 0.078239609 0.843520782
##              5 0.091503268 0.031045752 0.877450980
##              6 0.961832061 0.022900763 0.015267176
##              7 0.980468750 0.019531250 0.000000000

In this table we have the proportion of the satisfaction of who left by numbers of projects. We see the polarization again, people with 2 projects showed a low to regular satisfaction. The same goes for the group with 6 or 7 projects. Employees who worked on 4 or 5 projects had a predominantly high satisfaction.

We will see below if any one department has employees with more work load than others.

From the 6th project there is an increase in the average monthly hours worked in all departments. Already people with 2 projects worked for less hours also in all departments.

Why are the good and the satisfied leaving us?

To answer this question, the first step is to create a dataset with employee data that meet the criteria:

  1. In the last evaluation they had a score higher than 0.7;
  2. Worked in the company for more than 4 years;
  3. Worked on more than 5 projects;
  4. In the month, on average, they worked more than 200 hours.
## [1] 961

The hr_func_bons data.frame was created with 961 records.

It is not surprising that some people have decided to leave the company, they have worked more for the same salary range. Of course, for a more effective analysis of these variables, it would be necessary to know, for example, in the high wage range if the values are the same, or if those who worked the most received a larger amount, even though it was characterized as the same range.

This is the same information as the previous chart but now divided by department. The “good” employees only received high salaries in the sales and product_mg departments, but as we already know, they received the same range only for more hours of work.

Is the employee’s dissatisfaction perceptible?

What I want to know is if it is the employee’s perception of himself, that is, his level of satisfaction relates in some way to the way the employer perceives the employee and assigns a score to him through the evaluation.

Among those who left the company, we have three very distinct groups:

  1. Unsatisfied, but recognized as good by the company; In this group we have a clearly negative association between these two variables.
  2. Unsatisfied and unrecognized, that is, they scored low on the evaluation;
  3. Satisfied and recognized as good employee.

Receiving a bad grade in the assessment is not a common attribute among those who left.

Overworked, was it a negative factor?

It is most interesting that among those who left the division of the unsatisfied is very clear. In this view, it is evident, but without explaining the cause, the imbalance in the hour worked. Responding to the initial question, the unsatisfied lived two situations, which worked more than the average and worked well less than the monthly average for this variable that are 200 hours a month. As the hours worked approached this range of 200 to 250, satisfaction increased.

The same chart printed for the data set of the “good employees” shows that who left worked on average for more hours compared to those who still work in the company, for which the points in the graph are more dispersed.

Let’s print this same chart for the employees who recorded a lower income.

In this group we see that those who remained in the company had a positive trend in satisfaction, remaining between 0.5 and 1. Who leaves were more dissatisfied, perhaps because they did not reach their maximum potential or were not involved in many projects. As we are talking about human behavior, it is difficult to infer about real motivation without knowing more variables.

## [1] "Total de colaboradores medianos Left:"
## [1] 1526
## [1] "Total de bons colaboradores Left:"
## [1] 851
## Aggregation requires fun.aggregate: length used as default
Satisfacao Remain Left
NO 59 845
REGULAR 28 3
YES 23 3

Once again we confirm that most of the good off-staffers were unsatisfied employees.

Increasing the context, we will compare the total per band for both who left and remain.

table(hr$satisfied)
## 
##      NO REGULAR     YES 
##    3124    5577    6298
by(hr$left==1, hr$satisfied, summary)
## hr$satisfied: NO
##    Mode   FALSE    TRUE    NA's 
## logical    1389    1735       0 
## -------------------------------------------------------- 
## hr$satisfied: REGULAR
##    Mode   FALSE    TRUE    NA's 
## logical    4674     903       0 
## -------------------------------------------------------- 
## hr$satisfied: YES
##    Mode   FALSE    TRUE    NA's 
## logical    5365     933       0

As the range of satisfaction decreases, we perceive an increase in the total of who left and a decrease of the hired ones.

We have already explored and draw some conclusions about the relationship of the data set variables, now let’s create a linear model having as the focal variable left.

m1 <- lm(left ~ satisfaction_level, data = hr)
m2 <- update(m1, ~ . + time_spend_company)
m3 <- update(m2, ~ . + average_montly_hours)
m4 <- update(m3, ~ . + number_project)
mtable(m1, m2, m3, m4)
## 
## Calls:
## m1: lm(formula = left ~ satisfaction_level, data = hr)
## m2: lm(formula = left ~ satisfaction_level + time_spend_company, 
##     data = hr)
## m3: lm(formula = left ~ satisfaction_level + time_spend_company + 
##     average_montly_hours, data = hr)
## m4: lm(formula = left ~ satisfaction_level + time_spend_company + 
##     average_montly_hours + number_project, data = hr)
## 
## ====================================================================
##                            m1         m2         m3         m4      
## --------------------------------------------------------------------
##   (Intercept)            0.646***   0.526***   0.445***   0.501***  
##                         (0.009)    (0.012)    (0.017)    (0.018)    
##   satisfaction_level    -0.665***  -0.647***  -0.646***  -0.665***  
##                         (0.013)    (0.013)    (0.013)    (0.013)    
##   time_spend_company                0.031***   0.029***   0.033***  
##                                    (0.002)    (0.002)    (0.002)    
##   average_montly_hours                         0.000***   0.001***  
##                                               (0.000)    (0.000)    
##   number_project                                         -0.031***  
##                                                          (0.003)    
## --------------------------------------------------------------------
##   R-squared                  0.2        0.2        0.2        0.2   
##   adj. R-squared             0.2        0.2        0.2        0.2   
##   sigma                      0.4        0.4        0.4        0.4   
##   F                       2663.9     1450.7      985.3      773.5   
##   p                          0.0        0.0        0.0        0.0   
##   Log-likelihood         -7254.4    -7154.2    -7131.3    -7073.6   
##   Deviance                2310.4     2279.7     2272.8     2255.4   
##   AIC                    14514.8    14316.3    14272.6    14159.3   
##   BIC                    14537.7    14346.8    14310.7    14205.0   
##   N                      14999      14999      14999      14999     
## ====================================================================

The variables satisfaction_level, time_spend_company, average_montly_hours, and number_project obtained the same r² of 0.2. 20% is a weak result and below my expectation.

At this point, I know that the level of satisfaction may be related to why the employee leaves, so I will create another model in which this variable is the focal point for the others.

m1 <- lm(satisfaction_level ~ average_montly_hours, data = hr)
m2 <- update(m1, ~ . + time_spend_company)
m3 <- update(m3, ~ . + number_project)
m4 <- update(m4, ~ . + promotion_last_5years)
mtable(m1, m2, m3, m4)
## 
## Calls:
## m1: lm(formula = satisfaction_level ~ average_montly_hours, data = hr)
## m2: lm(formula = satisfaction_level ~ average_montly_hours + time_spend_company, 
##     data = hr)
## m3: lm(formula = left ~ satisfaction_level + time_spend_company + 
##     average_montly_hours + number_project, data = hr)
## m4: lm(formula = left ~ satisfaction_level + time_spend_company + 
##     average_montly_hours + number_project + promotion_last_5years, 
##     data = hr)
## 
## =====================================================================
##                             m1         m2         m3         m4      
## ---------------------------------------------------------------------
##   (Intercept)             0.633***   0.680***   0.501***   0.500***  
##                          (0.008)    (0.009)    (0.018)    (0.018)    
##   average_montly_hours   -0.000*    -0.000      0.001***   0.001***  
##                          (0.000)    (0.000)    (0.000)    (0.000)    
##   time_spend_company                -0.017***   0.033***   0.034***  
##                                     (0.001)    (0.002)    (0.002)    
##   satisfaction_level                           -0.665***  -0.662***  
##                                                (0.013)    (0.013)    
##   number_project                               -0.031***  -0.031***  
##                                                (0.003)    (0.003)    
##   promotion_last_5years                                   -0.177***  
##                                                           (0.022)    
## ---------------------------------------------------------------------
##   R-squared                   0.0        0.0        0.2        0.2   
##   adj. R-squared              0.0        0.0        0.2        0.2   
##   sigma                       0.2        0.2        0.4        0.4   
##   F                           6.0       77.5      773.5      634.5   
##   p                           0.0        0.0        0.0        0.0   
##   Log-likelihood           -403.7     -329.7    -7073.6    -7041.2   
##   Deviance                  926.8      917.7     2255.4     2245.6   
##   AIC                       813.5      667.3    14159.3    14096.3   
##   BIC                       836.3      697.8    14205.0    14149.6   
##   N                       14999      14999      14999      14999     
## =====================================================================

20% of the satisfaction level can be explained by the numbers of projects and promotions in the last 5 years. This relationship does not imply causality.

To complement our analysis we will create a new model using the rpart method to construct a decision tree. At the end of the process we will have a good view of the data.

First we need to separate one sample of test data and one sample of training. I’m defining it to be 80% the size of the original. I will change the contents of left, 0 = Remain and 1 = Left.

hr$left <- factor(hr$left, labels=c("Remain", "Left"))

# Cria amostra com 80% do tamanho de hr
sample_size <- floor(0.80 * nrow(hr))

# Set.seed para fins do resultado randomico
set.seed(123)

# Cria lista de índices
train_ind <- sample(seq_len(nrow(hr)), size = sample_size)

# Amostra para treino
train <- hr[train_ind, 1:8]

# Amostra para teste
test <- hr[-train_ind, 1:8 ]

The model will be based on the variables that have been investigated until now and that we consider important during the exploratory analysis. The factor to be predicted will be left, that is, whether the employee will remain or leave, and the model will be based on the level of satisfaction, note of the last evaluation, average monthly hours worked, number of projects and years worked in the company.

# Cria o modelo
fit <- rpart(left ~ satisfaction_level + last_evaluation + 
                    average_montly_hours + number_project + 
                    time_spend_company,
             data=train,
             method="class")

# Prediz o resultado
predicted_hr <- predict(fit, test)


auc(as.numeric(test$left) - 1, predicted_hr[, 2])
## Area under the curve: 0.9713

Wow! The classifier had an excellent score of 0.97.

Now that we know we have a great model, let’s plot the decision tree.

rpart.plot(fit, extra = 104, 
                box.palette="GnBu", 
                branch.lty=3, 
                shadow.col="gray", 
                nn=TRUE)

In fact it is a good model, see how it mirrors and reinforces the previous analysis, the employee with low satisfaction level, many projects, or satisfied and working for many hours a month is more likely to leave.

Multivariate analysis

Discuss the relationships observed in this part of the research. What attributes strengthened the others in the observation of the variables of interest?

When the level of satisfaction of the whole group is analyzed, a very different distribution is evident between those who remain in the company and those who leave. The remaining employees have a satisfaction between regular and satisfied, with a small volume of employees dissatisfied. Among those who left, the group of dissatisfied is the largest, demonstrating the correlation observed between them, but it can not be said that dissatisfaction is the cause of a dismissal for example.

The monthly hours worked also played an important role in the group of who left, being common to find the information polarized in groups of many or few hours worked monthly.

Surprising and/or interesting interactions were found among the attributes?

It was a surprise to find that working for fewer hours than the monthly average was a common factor of unsatisfaction among those who left.

Have models been created using this dataset? Discuss the strengths and limitations of your model.

Yes, three models were created, two linear and one decision tree.

The first linear regression model had the objective to verify the relationship between left and the variables satisfaction_level, time_spend_company, average_montly_hours, number_project. The variables were selected based on the correlation value found between them. The result received was a bit disappointing because the adjusted r² itself can only explain 20% of the dependent variable left. Another value to be observed by the model is that there is a high significance among the variables demonstrated by the value of p = 0.

In the 2nd model the objective of the linear regression model was to relate the level of satisfaction with the other variables. If it is not satisfied, it can be related, without being the cause, to the employee’s exit, so what could be related to his low satisfaction? In the model created with total years worked in the company and the average monthly hours worked has the lowest value of r 2 = 0, and the variables with the number of projects worked and the existence of promotion in the last years demonstrated a low correlation with the low Satisfaction of it being r² = 0.2.

As the two linear models succeeded, I broadened the analysis by creating the 3rd decision tree model to test the model’s ability and data to predict which employees are more likely to leave the company. I particularly like decision tree because you can create templates in a quick, simple way and they are effective in communicating the result. I used rpart.plot to print the sequence of decisions taken and illustrate what we had already observed during the exploratory analysis.


Final Charts and Summary

We started this project knowing a little about the company and its employees, we explored variables that indicated the existence of an accident at work, promotions, level of satisfaction, grade received in the last evaluation, number of projects, monthly average hours worked, salary and department . The focus, which has led us so far, was to find out the reasons why people left the company. Within this context, the left variable was the centerpiece.

In addition to displaying graphs with the distribution of the variables individually and the relationships between them, correlation values were calculated and 2 linear models and a decision tree were created.

To complete the exploration of the data will be displayed 3 graphs that summarize the observations found.

First Chart

Para começar selecionei as variáveis average_montly_hours e satisfaction_level. A variável satisfaction_level, dentre todas do conjunto de dados, foi a que teve o valor de correlação mais significativo com left, sendo -0.38. Em outros gráficos, average_montly_hours trouxe informações relevantes, demonstrando sempre a existência de uma média mensal superior de horas trabalhadas para os Lefts em comparação com os contratados.

Temos uma média de 198 horas mensais trabalhadas para os contratados e 224 horas para os Lefts. As máximas registradas foram 287 e 310 horas respectivamente.

We see here a summary of the observations made in this project, we conclude that the people who left the company were overburdened and unsatisfied. For those people we have:

  1. The unsatisfied who worked less than the general average and those who worked more than the average and more than the remain;
  2. Those with a good level of satisfaction, but who also had a monthly average of hours worked over 201 hours of the general average.

Second chart

The second chart displays information based on the number of projects worked. This variable had an important performance during the bivariate analysis, because we perceived a different behavior of the data for those who left whenever the number of projects was added to the analysis. We shall reproduce these observations below.

Well, this chart leads us to the following observations:

  1. Those who left was overloaded with 6 and 7 projects registered very low levels of satisfaction;
  2. We have low satisfaction level also among those who worked with 2 projects;
  3. For who left, the highest satisfactions were found in those who worked on 4 or 5 projects.

Third chart

In the last graph we have the data printed only for who left. In graph 1 we saw the level of satisfaction and working hours, in chart 2 we captured an overload of projects and now we will add the variable last_evaluation to have a broader view of the employee and the note given by the company in the last evaluation is important for understanding the context.

People with 2 projects experience a low or medium level of satisfaction as we have seen. Employees who worked on 4 or 5 projects received good grades in their assessments and demonstrated a good level of satisfaction. We also have those who worked on 6 and 7 projects, they received a good evaluation of the company, but they definitely were not satisfied.

Even if the company recognizes the employee’s work through a good grade in his assessment, he is likely to leave the company if he is feeling overwhelmed and dissatisfied.


Reflection

It is always very exciting to investigate a database looking for factors that can help an organization retain its talents and keep up with a top-notch team. The limitation of this analysis is that it is based on human behavior, and the few variables in the data set are not enough to deliver a consistent result that can be followed by the organization. It would be very productive for this analysis to know information about the positions, the level of satisfaction per year of work, the level of complexity of the projects worked and even the hiring regime.

With the available information it was possible to identify how much the work load is a determining factor for the employee’s exit. Even if he has gotten a good grade in the assessment and is satisfied, if he is under an overburden of design and hours of work he will probably leave. But if he is already a little dissatisfied and the company assigns him many hours or less hours than the average worked by all, it is possible that he also quit.

I would like to have more information on the monetary value of the salary. Understanding whether pay can be a factor of unsatisfaction is difficult when you do not know the job and the actual salary. What can be perceived in relation to the recognition of the work is that very few promotions have been registered, even the proportion of work accidents is higher than that of promotions. Among those who left few people had a high salary.

While not having all the variables that I consider important, it was fun to investigate the issues proposed at the beginning of this project, I would say that creating a decision tree, and getting such a high AUC was one of the most rewarding points in the process.

In future work it would be interesting to use random forest as a predictor of results, its more complex decision algorithm would be a good comparison with the decision tree model used in this project.